In this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data)
Data import
data =read.csv("data/realtor-data.zip.csv") # Data import
Data Transformation
data =subset(data, select =c(status, price, bed, bath, acre_lot, city, state, house_size)) # keep relevant columns
We chose the following data types for our variables:
Data dictionary
Code
tibble(Variable =c("price", "status", "acre_lot", "state", "house_size"),Description =c("The price for which the item was listed on the market","The status if the house is already sold or still for sale","The size of the land / lot on which the house is located in acres","The state in which the house is located","The size of the house in square feet" )) |>kable(caption ="Description of key variables in the dataset",align =c("l", "l") )
Description of key variables in the dataset
Variable
Description
price
The price for which the item was listed on the market
status
The status if the house is already sold or still for sale
acre_lot
The size of the land / lot on which the house is located in acres
state
The state in which the house is located
house_size
The size of the house in square feet
Summary statistic tables
In this section we will cover the summary of our cleaned dataset. We will explore basic statistical values from our data.
map_size <- data %>%group_by(state) %>%summarise(avg_size =mean(house_size, na.rm =TRUE), .groups ="drop") %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_size,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_size,text =~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),colorscale ="Greens",colorbar =list(title ="Avg Size (sqft)")) %>% plotly::layout(title =list(text ="Average House Size by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )
USA Map Extremfälle Price (Karte)
Code
map_extremes <- data %>%group_by(state) %>%summarise(min_price =suppressWarnings(min(price, na.rm =TRUE)),max_price =suppressWarnings(max(price, na.rm =TRUE)),.groups ="drop" ) %>%mutate(range_price = max_price - min_price) %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_extremes,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~range_price,text =~paste0( state,"<br>Min: $", formatC(min_price, big.mark =",", format ="f", digits =0),"<br>Max: $", formatC(max_price, big.mark =",", format ="f", digits =0) ),colorscale ="Reds",colorbar =list(title ="Price Range ($)")) %>% plotly::layout(title =list(text ="Price Extrem Values by U.S. State (Max − Min)"),geo =list(scope ="usa", projection =list(type ="albers usa")) )
Anzahl aktiver Listings pro State
Summary
Instructions
Summarise your finding.
Source Code
---title: "Data Exploration Gabriel Hamulic"subtitle: "Dataset: US Real Estate Dataset"author: "Hamulic, Gabriel"date: todayembed-resources: trueformat: html: output-file: US-Real-Estate_Gabriel_Hamulic.html #output-ext: "html.html" toc: true toc-location: right code-link: true code-tools: true #df-print: kable theme: light: flatly dark: darkly #echo: fenced pdf: output-file: US-Real-Estate_Gabriel_Hamulic.pdf toc: true number-sections: true code-link: true df-print: tibble crossref: lof-title: "List of Figures"fig-align: centerexecute: warning: false---\listoffigures \listoftables\listoflistings::: callout-caution### InstructionsYour report must be of high quality, meaning that your report:- is visually and textually pleasing of- does not look/read/feel like a draft instead of a finished analysis- explains/discusses your findings and results in the main text, e.g., explain/discuss all figures/table in the main text- is representable such that it can show to any interested third party- uses figure/table captions/linking/reference (see example further down)- Do not show any standard printout of R-code, use for data.frame/tibbles `knitr::kable()` printing.- Do not simply print datasets (too many lines) use instead `rmarkdown::paged_table()`:::{{< pagebreak >}}# Introduction## Libraries```{r}#| code-summary: Libraries#| code-fold: truelibrary <-function(...) {suppressPackageStartupMessages(base::library(...))}library(tidyverse)library(dplyr)library(knitr)library(tidyr)library(rmarkdown)library(janitor)library(scales)library(tidytext)library(ggforce)library(GGally)library(DT)library(kableExtra)```# Data## Data sourceIn this data exploration we are looking at the US Real Estate market with use of a dataset from kaggle published by Ahmed Shahriar Sakib. It contains over 2.2 Million Real Estate listings broken down to State, Size, Price (among other factors). (Source: <https://www.kaggle.com/datasets/ahmedshahriarsakib/usa-real-estate-dataset/data>)## Data import```{r setup, include=FALSE}options(dplyr.print_max = 15, dplyr.print_min = 10)``````{r}data =read.csv("data/realtor-data.zip.csv") # Data import```## Data Transformation```{r}data =subset(data, select =c(status, price, bed, bath, acre_lot, city, state, house_size)) # keep relevant columns```### Structure with standard datatypes```{r}#| code-fold: truestr_out <-capture.output(str(data))kable(data.frame(Output = str_out), col.names ="Structure Output")``````{r}#| code-fold: true# Assign Data Typesdata$status =as.factor(data$status)data$city =as.factor(data$city)data$state =as.factor(data$state)```### NA Removal```{r}#| code-fold: truebefore_rows <-nrow(data)data <-na.omit(data)after_rows <-nrow(data)kable(data.frame(Description =c("Before NA removal", "After NA removal"),Rows =c(before_rows, after_rows)))```The dataset now has `r nrow(data)` observations and `r ncol(data)` variables after removing rows with missing values.### Filtering```{r}#| code-fold: true# Filter min and max valuesdata = data |>filter(price >10000& price <1000000000)```### Calculations```{r}#| code-fold: truedata = data |>mutate(price_per_sqm = price/house_size)```### Cleaned Dataset```{r}#| code-fold: truepaged_table(data)```### Structure after transformation```{r}#| code-fold: truestr_out <-capture.output(str(data))kable(data.frame(Output = str_out), col.names ="Structure Output")```We chose the following data types for our variables:- ## Data dictionary```{r}#| code-fold: truetibble(Variable =c("price", "status", "acre_lot", "state", "house_size"),Description =c("The price for which the item was listed on the market","The status if the house is already sold or still for sale","The size of the land / lot on which the house is located in acres","The state in which the house is located","The size of the house in square feet" )) |>kable(caption ="Description of key variables in the dataset",align =c("l", "l") )```# Summary statistic tablesIn this section we will cover the summary of our cleaned dataset. We will explore basic statistical values from our data.## Numeric Statistics### Summary of numerical values```{r}#| label: "Numeric Statistics"#| tbl-cap: "Summary statistics of numerical variables in dataframe"#| code-fold: truedata |> janitor::clean_names() |>mutate(row =row_number() |>factor()) |>pivot_longer(cols =where(is.numeric)) |>group_by(name) |>summarize(N =n(),min =min(value),mean =mean(value),median =median(value),max =max(value),st.dev =sd(value) ) |> knitr::kable(digits =2)```### Visualisation of numerical values```{r}#| label: "Logarithmic Visualisation"#| tbl-cap: "Visualisation of numerical variables in dataframe"#| code-fold: truedata |>clean_names() |>pivot_longer(cols =where(is.numeric)) |>ggplot(aes(x = value, fill = name)) +geom_histogram(bins =30, alpha =0.7, color ="white") +scale_x_log10(labels =label_comma()) +# 👈 echte Werte, log-Skalafacet_wrap(~ name, scales ="free_x") +theme_minimal() +labs(title ="Distribution of Numerical Variables (logarithmic scale)",x ="Value",y ="Count" ) +theme(legend.position ="none",axis.text.x =element_text(angle =25, hjust =1) )```## Nominal Statistics### Summary of nominal variables (top categories)```{r}#| label: "Nominal Statistics"#| tbl-cap: "Top categories for factor variables with counts, proportions, and mean price"#| code-fold: truetop_n_per_var <-10nominal_summary <- data |>clean_names() |>select(where(is.factor), price) |>pivot_longer(cols =where(is.factor),names_to ="Variable",values_to ="Category") |>group_by(Variable, Category) |>summarise(Count =n(),Percent =round(100* Count /nrow(data), 2),Mean_Price =round(mean(price, na.rm =TRUE), 0),.groups ="drop" ) |>group_by(Variable) |>slice_max(order_by = Count, n = top_n_per_var, with_ties =FALSE) |>ungroup()kable( nominal_summary,caption =paste0("Top ", top_n_per_var," categories per factor variable (counts, share %, and mean price)" ),digits =2,align =c("l", "l", "r", "r", "r"))```### Visualisation of nominal variables (top categories)```{r}#| label: "Visualization Nominal Statistics"#| tbl-cap: "Top categories for factor variables with counts, proportions, and mean price"#| code-fold: truenominal_summary <- nominal_summary |>group_by(Variable) |>mutate(Category = forcats::fct_reorder(Category, Count),Category =factor(Category, levels =unique(Category))) |>ungroup()# Plot: Facets untereinander, mit eigener x-Skala und y-Skala pro Variableggplot(nominal_summary, aes(x = Count, y = Category, fill = Variable)) +geom_col(show.legend =FALSE, alpha =0.8, width =0.7) +facet_wrap(~ Variable, ncol =1, scales ="free", drop =TRUE) +scale_x_continuous(labels =label_comma()) +#Tausendertrennung, keine 1e+05theme_minimal() +labs(title ="Top Categories per Factor Variable",x ="Count",y ="Category" ) +theme(panel.spacing.y =unit(1, "lines"),strip.text =element_text(size =12, face ="bold"),axis.text.y =element_text(size =8),plot.margin =margin(5, 15, 5, 5) )```# Bivariate Analysis### Pairs Plot (all numeric variables)```{r}#| label: "Pairs Plot"#| code-fold: trueset.seed(123)data_num <- data |> janitor::clean_names() |>select(where(is.numeric)) |>slice_sample(n =3000) |>mutate(across(everything(), log1p)) p <-ggpairs( data_num,progress =FALSE,upper =list(continuous =wrap("cor", size =4, alignPercent =0.8, stars =TRUE)),lower =list(continuous =wrap("points", alpha =0.3, size =0.7)),diag =list(continuous =wrap("densityDiag", alpha =0.7)))p +theme_minimal(base_size =11) +theme(strip.text =element_text(size =8, face ="bold"),panel.grid =element_blank(),axis.text =element_text(size =8),axis.title =element_text(size =9),plot.title =element_text(face ="bold", size =14, hjust =0.5) ) +labs(title ="Scatterplot-Matrix (log-transformiert, n=3000)")```### Price vs. house size by listing status```{r}#| label: "scatter_price_size"#| tbl-cap: "Price vs. house size by listing status"#| code-fold: truelibrary(dplyr)library(plotly)# Stichprobe ziehen für Performanceset.seed(123)sample_data <- data %>%sample_n(50000)plot_ly( sample_data,x =~house_size,y =~price,color =~status,type ="scatter",mode ="markers",alpha =0.6) %>% plotly::layout(title =list(text ="Relationship Between House Size and Price by Status"),xaxis =list(title ="House Size (sqft)"),yaxis =list(title ="Price ($)", type ="log") )```### Multiple Regression```{r}#| code-fold: truemodel <-lm(price ~ house_size + bath + bed + state + status, data = data)library(broom)tidy(model) |>arrange(p.value) |>mutate(estimate =round(estimate, 1),std.error =round(std.error, 1),statistic =round(statistic, 1),p.value =signif(p.value, 3) ) |>datatable(caption ="Regressionsergebnisse (interaktiv)",filter ="top", options =list(pageLength =10, autoWidth =TRUE,responsive =TRUE ) )``````{r}#| label: "Graphic Regression"#| code-fold: true#| tidy(model, conf.int =TRUE) |>filter(term !="(Intercept)") |>mutate(term =reorder(term, estimate)) |>ggplot(aes(x = estimate, y = term, fill = estimate >0)) +geom_col(show.legend =FALSE) +geom_vline(xintercept =0, linetype ="dashed") +theme_minimal() +labs(title ="Greatest contributors to price",x ="regression coefficient",y ="" )```### USA Map – Average Price```{r}#| label: "state_mapping"#| code-fold: truelibrary(dplyr)library(tibble)valid_states <-tibble(state_name =c(state.name, "District of Columbia"),state_abbr =c(state.abb, "DC"))``````{r}#| label: "map_avg_price"#| tbl-cap: "Average property price by U.S. state"#| code-fold: truelibrary(dplyr)library(plotly)map_price <- data |>group_by(state) %>%summarise(avg_price =mean(price, na.rm =TRUE), .groups ="drop") |>inner_join(valid_states, by =c("state"="state_name")) |>mutate(avg_price_k = avg_price /1000)plot_ly( map_price,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_price_k,text =~paste0(state, "<br>Avg Price: $", round(avg_price_k, 1), "K"),colorscale ="Blues",colorbar =list(title ="Avg Price ($K)")) |> plotly::layout(title =list(text ="Average Property Price by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```### USA Map AVG Size (Karte)```{r}#| label: "map_avg_size"#| tbl-cap: "Average house size by U.S. state"#| code-fold: truemap_size <- data %>%group_by(state) %>%summarise(avg_size =mean(house_size, na.rm =TRUE), .groups ="drop") %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_size,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~avg_size,text =~paste0(state, "<br>Avg Size: ", round(avg_size), " sqft"),colorscale ="Greens",colorbar =list(title ="Avg Size (sqft)")) %>% plotly::layout(title =list(text ="Average House Size by U.S. State"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```### USA Map Extremfälle Price (Karte)```{r}#| label: "map_extreme_price"#| tbl-cap: "Price range (max − min) by U.S. state"#| code-fold: truemap_extremes <- data %>%group_by(state) %>%summarise(min_price =suppressWarnings(min(price, na.rm =TRUE)),max_price =suppressWarnings(max(price, na.rm =TRUE)),.groups ="drop" ) %>%mutate(range_price = max_price - min_price) %>%inner_join(valid_states, by =c("state"="state_name"))plot_ly( map_extremes,type ="choropleth",locationmode ="USA-states",locations =~state_abbr,z =~range_price,text =~paste0( state,"<br>Min: $", formatC(min_price, big.mark =",", format ="f", digits =0),"<br>Max: $", formatC(max_price, big.mark =",", format ="f", digits =0) ),colorscale ="Reds",colorbar =list(title ="Price Range ($)")) %>% plotly::layout(title =list(text ="Price Extrem Values by U.S. State (Max − Min)"),geo =list(scope ="usa", projection =list(type ="albers usa")) )```### Anzahl aktiver Listings pro State# Summary::: callout-caution### InstructionsSummarise your finding.:::